Fast Top-k Distance-Based Outlier Detection on Uncertain Data
نویسندگان
چکیده
This paper studies the problem of top-k distance-based outlier detection on uncertain data. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. We start with the Naive approach. We then introduce a populated-cell list (PC-list), a sorted list of non-empty cells of a grid (grid is used to index our data). Using PC-list, our top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. An approximate top-k outlier detection algorithm is also presented to further increase the efficiency of our outlier detection algorithm. An extensive empirical study on synthetic and real datasets shows that our proposed approaches are efficient and scalable.
منابع مشابه
A Study on Distance-based Outlier Detection on Uncertain Data
Uncertain data management, querying and mining have become important because the majority of real world data is accompanied with uncertainty these days. Uncertainty in data is often caused by the deficiency in underlying data collecting equipments or sometimes manually introduced to preserve data privacy. The uncertainty information in the data is useful and can be used to improve the quality o...
متن کاملTop-k Distance-Based Outlier Detection on Uncertain Data
This paper studies the problem of top-k distance-based outlier detection on uncertain data. In this work, an uncertain object is modelled by a Gaussian probability density function. Since the Naive approach is very expensive due to costly distance function between uncertain objects, a populated-cell list (PC-list) based top-k distance-based outlier detection approach is proposed in this work. W...
متن کاملDetecting High-Dimensional Outliers: the New Task, Algorithms and Performance
Outlier detection is a fundamental step in knowledge discovery in databases. With the increasing number of high-dimensional databases, existing outlier detection algorithms that work only in the context of full space are unable to effectively screen out informative outliers. This is because majority of these outliers exists only in subspaces. In this paper, we identify a new outlier detection t...
متن کاملDistance-Based Outlier Detection on Uncertain Data of Gaussian Distribution
Managing and mining uncertain data is becoming important with the increase in the use of devices responsible for generating uncertain data, for example sensors, RFIDs, etc. In this paper, we extend the notion of distance-based outliers for uncertain data. To the best of our knowledge, this is the first work on distance-based outlier detection on uncertain data of Gaussian distribution. Since th...
متن کاملA New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data
Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world KDD applications. Existing outlier detection methods are ineffective on scattered real-world datasets due to implicit data patterns and parameter setting issues. We define a novel Local Distance-based Outlier Factor (LDOF) to measure the outlier-ness of objects in sc...
متن کامل